Creating new pipeline using seurat v4.0.2 available 2021.06.23
Important notes:
percent.mt, but NOT regressing on percent.mtnCounts_RNA and nFeature_RNAknitr::opts_knit$set(root.dir = "~/Desktop/10XGenomicsData/msAggr_scRNASeq/IndividualPops/")
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(Seurat)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching SeuratObject
library(patchwork)
library(ggplot2)
library(clustree)
Loading required package: ggraph
source("~/Desktop/10XGenomicsData/msAggr_scRNASeq/RFunctions/read_10XGenomics_data.R")
source("~/Desktop/10XGenomicsData/msAggr_scRNASeq/RFunctions/PercentVariance.R")
source("~/Desktop/10XGenomicsData/msAggr_scRNASeq/RFunctions/ColorPalette.R")
source("~/Desktop/10XGenomicsData/msAggr_scRNASeq/RFunctions/Mouse2Human_idconversion.R")
ScaleDatahttps://bioconductor.org/packages/3.10/workflows/vignettes/simpleSingleCell/inst/doc/batch.html#62_for_gene-based_analyses >You can also normalize and scale data for the RNA assay. There are numerous resources on this, but Aaron Lun describes why the original log-normalized values should be used for DE and visualizations of expression quite well here: > >For gene-based procedures like differential expression (DE) analyses or gene network construction, it is desirable to use the original log-expression values or counts. The corrected values are only used to obtain cell-level results such as clusters or trajectories. Batch effects are handled explicitly using blocking terms or via a meta-analysis across batches. We do not use the corrected values directly in gene-based analyses, for various reasons: > >It is usually inappropriate to perform DE analyses on batch-corrected values, due to the failure to model the uncertainty of the correction. This usually results in loss of type I error control, i.e., more false positives than expected. > >The correction does not preserve the mean-variance relationship. Applications of common DE methods like edgeR or limma are unlikely to be valid. > >Batch correction may (correctly) remove biological differences between batches in the course of mapping all cells onto a common coordinate system. Returning to the uncorrected expression values provides an opportunity for detecting such differences if they are of interest. Conversely, if the batch correction made a mistake, the use of the uncorrected expression values provides an important sanity check. > >In addition, the normalized values in SCT and integrated assays don’t necessary correspond to per-gene expression values anyway, rather containing residuals (in the case of the scale.data slot for each).
Mess with how to load 4 cell populations into single seurat object
SET SEED?????!!!!!
projectName <- "CMPSubpop"
jackstraw.dim <- 40
sessionInfo.filename <- paste0(projectName, "_sessionInfo.txt")
sink(sessionInfo.filename)
sessionInfo()
sink()
setwd("~/Desktop/10XGenomicsData/cellRanger/") # temporarily changing wd only works if you run the entire chunk at once
data_file.list <- read_10XGenomics_data(sample.list = c("CMPm2"))
data.object<-Read10X(data_file.list)
seurat.object<- CreateSeuratObject(counts = data.object, min.cells = 3, min.genes = 200, project = projectName)
Clean up to free memory
remove(data.object)
Add mitochondrial metadata and plot some basic features
seurat.object[["percent.mt"]] <- PercentageFeatureSet(seurat.object, pattern = "^mt-")
VlnPlot(seurat.object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0, fill.by = 'orig.ident', )
plot1 <- FeatureScatter(seurat.object, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by = "orig.ident", pt.size = 0.01)
plot2 <- FeatureScatter(seurat.object, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "orig.ident", pt.size = 0.01)
plot1 + plot2
remove low quality cells require: nFeature_RNA between 200 and 4000 (inclusive) require: percent.mt <= 5
print(paste("original object:", nrow(seurat.object@meta.data), "cells", sep = " "))
[1] "original object: 12540 cells"
seurat.object <- subset(seurat.object,
subset = nFeature_RNA >=200 &
nFeature_RNA <= 4000 &
percent.mt <= 5
)
print(paste("new object:", nrow(seurat.object@meta.data), "cells", sep = " "))
[1] "new object: 12059 cells"
Struggling to wrap my head around this one. It seems that SCTransform is best for batch correction, but NormalizeData and ScaleData are best for DGE. Several vignettes have performed both
`selection.method
How to choose top variable features. Choose one of :
vst: First, fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).
mean.var.plot (mvp): First, uses a function to calculate average expression (mean.function) and dispersion (dispersion.function) for each feature. Next, divides features into num.bin (deafult 20) bins based on their average expression, and calculates z-scores for dispersion within each bin. The purpose of this is to identify variable features while controlling for the strong relationship between variability and average expression.
dispersion (disp): selects the genes with the highest dispersion values`
seurat.object <- NormalizeData(seurat.object, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Find variable features
seurat.object <- FindVariableFeatures(seurat.object, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
top10 <- head(VariableFeatures(seurat.object), 10)
plot1 <- VariableFeaturePlot(seurat.object)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
Scale data (linear transformation)
all.genes <- rownames(seurat.object)
seurat.object <- ScaleData(seurat.object, features = all.genes, vars.to.regress = c("nCount_RNA", "nFeature_RNA"))
Regressing out nCount_RNA, nFeature_RNA
|
| | 0%
|
|= | 0%
|
|= | 1%
|
|== | 1%
|
|== | 2%
|
|=== | 2%
|
|=== | 3%
|
|==== | 3%
|
|==== | 4%
|
|===== | 4%
|
|===== | 5%
|
|====== | 5%
|
|======= | 5%
|
|======= | 6%
|
|======== | 6%
|
|======== | 7%
|
|========= | 7%
|
|========= | 8%
|
|========== | 8%
|
|========== | 9%
|
|=========== | 9%
|
|=========== | 10%
|
|============ | 10%
|
|============= | 10%
|
|============= | 11%
|
|============== | 11%
|
|============== | 12%
|
|=============== | 12%
|
|=============== | 13%
|
|================ | 13%
|
|================ | 14%
|
|================= | 14%
|
|================== | 14%
|
|================== | 15%
|
|=================== | 15%
|
|=================== | 16%
|
|==================== | 16%
|
|==================== | 17%
|
|===================== | 17%
|
|===================== | 18%
|
|====================== | 18%
|
|====================== | 19%
|
|======================= | 19%
|
|======================== | 19%
|
|======================== | 20%
|
|========================= | 20%
|
|========================= | 21%
|
|========================== | 21%
|
|========================== | 22%
|
|=========================== | 22%
|
|=========================== | 23%
|
|============================ | 23%
|
|============================ | 24%
|
|============================= | 24%
|
|============================== | 24%
|
|============================== | 25%
|
|=============================== | 25%
|
|=============================== | 26%
|
|================================ | 26%
|
|================================ | 27%
|
|================================= | 27%
|
|================================= | 28%
|
|================================== | 28%
|
|================================== | 29%
|
|=================================== | 29%
|
|==================================== | 29%
|
|==================================== | 30%
|
|===================================== | 30%
|
|===================================== | 31%
|
|====================================== | 31%
|
|====================================== | 32%
|
|======================================= | 32%
|
|======================================= | 33%
|
|======================================== | 33%
|
|========================================= | 33%
|
|========================================= | 34%
|
|========================================== | 34%
# save.image(file = paste0(projectName, '.RData'))
saveRDS(seurat.object, file = paste0(projectName, "_raw.RDS"))
linear dimensional reduction. Default are based on VariableFeatures, but can be changed
seurat.object <- RunPCA(seurat.object, features = VariableFeatures(object = seurat.object))
PC_ 1
Positive: Vamp5, Nkg7, Car2, Apoe, Ctla2a, Gata2, Cd63, Rps23, Rps2, Itga2b
Adgrg1, Sdsl, Nrgn, Hmgb3, Pdcd4, Pf4, F2r, Angpt1, Hsp90ab1, Rps17
H2-Q7, Gp5, Aqp1, Cavin2, Fyb, Tmem40, Icam4, Mfsd2b, Ifitm3, Lat
Negative: Lgals3, Aif1, Id2, Cst3, H2-Aa, Cd74, Plbd1, Irf8, H2-Eb1, Ctsh
Ccr2, Ifi205, Tmsb10, Batf3, Psap, Cd52, Ms4a6c, Lsp1, H2-Ab1, S100a6
Pld4, Naaa, Ctss, Tyrobp, Rab7b, Ckb, Ifi30, Mpeg1, H2-DMb1, Ly86
PC_ 2
Positive: Prtn3, Ctsg, H2afy, Mpo, Emb, Ccl9, Plac8, Cd34, BC035044, Pgam1
Ramp1, Mif, Fkbp1a, Phgdh, Serpinb1a, Pkm, Bex6, Cd53, Tyrobp, Ung
Sell, Cdca7, Nme2, Calr, Hsp90ab1, Serpinf1, Anxa3, Limd2, Rps23, Clec12a
Negative: Ube2c, Cenpf, Nusap1, Mki67, Cenpa, Hmmr, Birc5, Prc1, Ckap2l, Cdca8
Ccnb1, H2afx, Cks2, Plk1, Cd9, Apoe, Tpx2, Cdc20, Top2a, Ccnb2
Vamp5, Cenpe, Nucks1, Kif22, Mfsd2b, Lockd, Hist1h2ap, Pf4, Cavin2, Cdkn2d
PC_ 3
Positive: Pf4, Tmsb4x, Cavin2, Serpine2, Rap1b, Treml1, Vwf, Rab27b, F2rl2, Pbx1
Gucy1a1, Cd9, Ehd3, Gp1bb, Itga2b, Slc14a1, Mmrn1, Gp9, Timp3, Plek
Nrgn, Unc119, Rgs10, Gm10419, Ptgs1, Mpl, Ache, Trpc6, Lims1, Ppbp
Negative: Plac8, Cks2, Cenpa, Ube2c, Tubb4b, Arl6ip1, Hmgb2, Hmmr, Sox4, Ccnb2
Tubb5, Cdc20, Nusap1, Ccnb1, Cenpf, Fos, Ramp1, Cdca8, Hist1h2ap, Stmn1
Hist1h2ac, Vim, Tent5a, Birc5, Aurka, Cenpe, Tpx2, Cdca3, Prc1, Cpa3
PC_ 4
Positive: Csrp3, Car1, Jun, Apoe, Jund, Ifngr1, Rnase6, Fos, Vim, Egr1
Ifitm1, Klf1, Tspo2, H2-Eb1, Ier2, Vamp5, Crip1, Cd74, Dusp1, Junb
Gstm5, H2-Aa, Plbd1, Itgb7, Gm15915, Id2, H2-Ab1, Lgals3, Aqp1, Blvrb
Negative: H2afz, Hmgn2, Birc5, Hmgb1, Tuba1b, Top2a, Hmgb2, Cdca8, Mki67, Pclaf
Cks1b, Spc24, Ppia, Mif, Pbk, Cdk1, Ckap2l, Lockd, Ran, Cdca3
Smc4, Hist1h1b, Nusap1, Pf4, Stmn1, Cenpf, Slpi, Cavin2, Hmmr, Smc2
PC_ 5
Positive: Pclaf, Tyms, Rrm2, Tk1, Pcna, Dut, Lig1, Ranbp1, Stmn1, Tuba1b
Gmnn, Hsp90aa1, Ptma, Slbp, Tipin, Dctpp1, Clspn, Dhfr, Rad51, Fen1
Spc24, Hist1h1b, Car1, Rad51ap1, Dtymk, Cycs, Rrm1, Uhrf1, Dek, Syce2
Negative: Hist1h2bc, Tsc22d1, Smim14, Ccnb2, Serpinb1a, Ccl9, Cd27, Cd34, Tmsb4x, Cenpa
Cdc20, Cdkn3, Prtn3, Gpr171, Gm19590, Glipr1, Satb1, Plek, Ccnb1, Ube2c
Rgs2, Lims1, Cenpf, Cd9, Ccl3, Bex6, Dhrs3, Hist1h1c, Ifi203, Mef2c
Plot results
VizDimLoadings(seurat.object, dims = 1:6, nfeatures = 10, reduction = "pca", ncol = 2)
DimPlot colored by orig.ident
DimPlot(seurat.object, reduction = "pca", group.by = "orig.ident")
Let’s put in a concerted effort to pick the right dimensionality using the newest software
# jackstraw.dim <- 40
# seurat.object <- JackStraw(seurat.object, num.replicate = 100, dims = jackstraw.dim) #runs ~50 min
# seurat.object <- ScoreJackStraw(seurat.object, dims = 1:jackstraw.dim)
# save.image(paste0(projectName, ".RData"))
Draw dim.reduction plots
# JackStrawPlot(seurat.object, dims = 25:36)
ElbowPlot(seurat.object, ndims = 50)
percent.variance(seurat.object@reductions$pca@stdev)
Number of PCs describing X% of variance
ElbowPlot(seurat.object, ndims = 50)
percent.variance(seurat.object@reductions$pca@stdev)
Percent of PCs describing X% of variance (transcribed from above cell because I don’t know how to freeze results)
Num pcs for 80% variance: 12 Num pcs for 85% variance: 18 Num pcs for 90% variance: 26 Num pcs for 95% variance: 37
set total.var <- 80%
tot.var <- percent.variance(seurat.object@reductions$pca@stdev, plot.var = FALSE, return.val = TRUE)
paste0("Num pcs for 80% variance: ", length(which(cumsum(tot.var) <= 80)))
[1] "Num pcs for 80% variance: 18"
paste0("Num pcs for 85% variance: ", length(which(cumsum(tot.var) <= 85)))
[1] "Num pcs for 85% variance: 25"
paste0("Num pcs for 90% variance: ", length(which(cumsum(tot.var) <= 90)))
[1] "Num pcs for 90% variance: 33"
paste0("Num pcs for 95% variance: ", length(which(cumsum(tot.var) <= 95)))
[1] "Num pcs for 95% variance: 41"
Plot UMAP
tot.var <- percent.variance(seurat.object@reductions$pca@stdev, plot.var = FALSE, return.val = TRUE)
ndims <- length(which(cumsum(tot.var) <= 80))
seurat.object <- FindNeighbors(seurat.object, dims = 1:ndims)
Computing nearest neighbor graph
Computing SNN
seurat.object <- FindClusters(seurat.object, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8537
Number of communities: 11
Elapsed time: 3 seconds
seurat.object <- RunUMAP(seurat.object, dims = 1: ndims)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
18:16:37 UMAP embedding parameters a = 0.9922 b = 1.112
18:16:37 Read 12059 rows and found 18 numeric columns
18:16:37 Using Annoy for neighbor search, n_neighbors = 30
18:16:37 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:16:39 Writing NN index file to temp file /var/folders/4f/fwrj6fnn1dn4g8wsf0zv563hjsvl24/T//RtmplnWyat/file1481828295e68
18:16:39 Searching Annoy index using 1 thread, search_k = 3000
18:16:46 Annoy recall = 100%
18:16:46 Commencing smooth kNN distance calibration using 1 thread
18:16:48 Initializing from normalized Laplacian + noise
18:16:49 Commencing optimization for 200 epochs, with 488156 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:16:59 Optimization finished
for(x in c(0.5, 1, 1.5, 2, 2.5)){
seurat.object <- FindClusters(seurat.object, resolution = x)
}
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8537
Number of communities: 11
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8051
Number of communities: 18
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7743
Number of communities: 24
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7514
Number of communities: 29
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 401749
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7321
Number of communities: 34
Elapsed time: 2 seconds
for (meta.col in colnames(seurat.object@meta.data)){
if(grepl(pattern = ("RNA_snn_res"), x = meta.col)==TRUE){
myplot <- DimPlot(seurat.object,
group.by = meta.col,
reduction = "umap",
cols = color.palette
) +
ggtitle(paste0(projectName, " dim", ndims, "res", gsub("RNA_snn_res", "", meta.col) ))
plot(myplot)
}
}
saveRDS(seurat.object, file = paste0(projectName, "_dim", ndims, ".RDS"))
set total.var <- 85%
clustree(seurat.object, prefix = "RNA_snn_res.", node_colour = "sc3_stability") +
scale_color_continuous(low = 'red3', high = 'white')
Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
Please use the `.add` argument instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
Plot UMAP
tot.var <- percent.variance(seurat.object@reductions$pca@stdev, plot.var = FALSE, return.val = TRUE)
ndims <- length(which(cumsum(tot.var) <= 85))
seurat.object <- FindNeighbors(seurat.object, dims = 1:ndims)
Computing nearest neighbor graph
Computing SNN
seurat.object <- FindClusters(seurat.object, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8508
Number of communities: 9
Elapsed time: 3 seconds
seurat.object <- RunUMAP(seurat.object, dims = 1: ndims)
18:20:36 UMAP embedding parameters a = 0.9922 b = 1.112
18:20:36 Read 12059 rows and found 25 numeric columns
18:20:36 Using Annoy for neighbor search, n_neighbors = 30
18:20:36 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:20:38 Writing NN index file to temp file /var/folders/4f/fwrj6fnn1dn4g8wsf0zv563hjsvl24/T//RtmplnWyat/file1481833f51271
18:20:38 Searching Annoy index using 1 thread, search_k = 3000
18:20:45 Annoy recall = 100%
18:20:46 Commencing smooth kNN distance calibration using 1 thread
18:20:48 Initializing from normalized Laplacian + noise
18:20:48 Commencing optimization for 200 epochs, with 500658 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:20:59 Optimization finished
for(x in c(0.5, 1, 1.5, 2, 2.5)){
seurat.object <- FindClusters(seurat.object, resolution = x)
}
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8508
Number of communities: 9
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8026
Number of communities: 18
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7700
Number of communities: 22
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7461
Number of communities: 28
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 424091
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7257
Number of communities: 33
Elapsed time: 3 seconds
for (meta.col in colnames(seurat.object@meta.data)){
if(grepl(pattern = ("RNA_snn_res"), x = meta.col)==TRUE){
myplot <- DimPlot(seurat.object,
group.by = meta.col,
reduction = "umap",
cols = color.palette
) +
ggtitle(paste0(projectName, " dim", ndims, "res", gsub("RNA_snn_res", "", meta.col) ))
plot(myplot)
}
}
Must ensure we have the right cluster stability, that is, cells that start in the same cluster tend to stay in the same cluster. If your data is over-clustered, cells will bounce between groups.
Following [this tutorial by Matt O.].https://towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92. Previously my favourite has been Clustree, which gives a nice visual NB: For some reason clustree::clustree() didn’t work, whereas library(clustree) followed by clustree() did.
saveRDS(seurat.object, file = paste0(projectName, "_dim", ndims, ".RDS"))
clustree(seurat.object, prefix = "RNA_snn_res.", node_colour = "sc3_stability") +
scale_color_continuous(low = 'red3', high = 'white')
Plot UMAP
tot.var <- percent.variance(seurat.object@reductions$pca@stdev, plot.var = FALSE, return.val = TRUE)
ndims <- length(which(cumsum(tot.var) <= 90))
seurat.object <- FindNeighbors(seurat.object, dims = 1:ndims)
Computing nearest neighbor graph
Computing SNN
seurat.object <- FindClusters(seurat.object, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8502
Number of communities: 9
Elapsed time: 3 seconds
seurat.object <- RunUMAP(seurat.object, dims = 1: ndims)
18:24:56 UMAP embedding parameters a = 0.9922 b = 1.112
18:24:56 Read 12059 rows and found 33 numeric columns
18:24:56 Using Annoy for neighbor search, n_neighbors = 30
18:24:56 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:24:58 Writing NN index file to temp file /var/folders/4f/fwrj6fnn1dn4g8wsf0zv563hjsvl24/T//RtmplnWyat/file148181e8dd35d
18:24:58 Searching Annoy index using 1 thread, search_k = 3000
18:25:05 Annoy recall = 100%
18:25:06 Commencing smooth kNN distance calibration using 1 thread
18:25:08 Initializing from normalized Laplacian + noise
18:25:09 Commencing optimization for 200 epochs, with 513294 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:25:20 Optimization finished
for(x in c(0.5, 1, 1.5, 2, 2.5)){
seurat.object <- FindClusters(seurat.object, resolution = x)
}
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8502
Number of communities: 9
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7974
Number of communities: 17
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7652
Number of communities: 22
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7404
Number of communities: 27
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 452999
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7189
Number of communities: 30
Elapsed time: 3 seconds
for (meta.col in colnames(seurat.object@meta.data)){
if(grepl(pattern = "RNA_snn_res", x = meta.col)==TRUE | grepl(pattern = "orig.ident", x = meta.col)==TRUE){
myplot <- DimPlot(seurat.object,
group.by = meta.col,
reduction = "umap",
pt.size = 1,
cols = color.palette) +
ggtitle(paste0(projectName, " dim", ndims, "res.", gsub("RNA_snn_res.", "", meta.col) ))
plot(myplot)
png(filename = paste0(projectName, " dim", ndims, "res.", gsub("RNA_snn_res.", "", meta.col), "-umap.png"), height = 800, width = 800)
plot(myplot)
dev.off()
myplot <- DimPlot(seurat.object,
group.by = meta.col,
reduction = "umap",
pt.size = 1,
cols = color.palette) +
facet_wrap(meta.col) +
ggtitle(paste0(projectName, " dim", ndims, "res.", gsub("RNA_snn_res.", "", meta.col)))
png(filename = paste0(projectName, " dim", ndims, "res.", gsub("RNA_snn_res.", "", meta.col), "-umap_FacetRes.png"), height = 800, width = 800)
plot(myplot)
dev.off()
}
}
saveRDS(seurat.object, file = paste0(projectName, "_dim", ndims, ".RDS"))
clustree(seurat.object, prefix = "RNA_snn_res.", node_colour = "sc3_stability") +
scale_color_continuous(low = 'red3', high = 'white')
png(filename = paste0(projectName, "_dim", ndims, "-clustree.png"), height = 800, width = 1600)
clustree(seurat.object, prefix = "RNA_snn_res.", node_colour = "sc3_stability") +
scale_color_continuous(low = 'red3', high = 'white')
dev.off()
quartz_off_screen
2
Plot UMAP
tot.var <- percent.variance(seurat.object@reductions$pca@stdev, plot.var = FALSE, return.val = TRUE)
ndims <- length(which(cumsum(tot.var) <= 95))
seurat.object <- FindNeighbors(seurat.object, dims = 1:ndims)
Computing nearest neighbor graph
Computing SNN
seurat.object <- FindClusters(seurat.object, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8472
Number of communities: 9
Elapsed time: 4 seconds
seurat.object <- RunUMAP(seurat.object, dims = 1: ndims)
18:30:03 UMAP embedding parameters a = 0.9922 b = 1.112
18:30:03 Read 12059 rows and found 41 numeric columns
18:30:03 Using Annoy for neighbor search, n_neighbors = 30
18:30:03 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:30:05 Writing NN index file to temp file /var/folders/4f/fwrj6fnn1dn4g8wsf0zv563hjsvl24/T//RtmplnWyat/file14818712f8e56
18:30:05 Searching Annoy index using 1 thread, search_k = 3000
18:30:13 Annoy recall = 100%
18:30:13 Commencing smooth kNN distance calibration using 1 thread
18:30:16 Initializing from normalized Laplacian + noise
18:30:17 Commencing optimization for 200 epochs, with 524686 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
18:30:28 Optimization finished
for(x in c(0.5, 1, 1.5, 2, 2.5)){
seurat.object <- FindClusters(seurat.object, resolution = x)
}
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8472
Number of communities: 9
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7908
Number of communities: 17
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7591
Number of communities: 20
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7332
Number of communities: 26
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 12059
Number of edges: 478535
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7121
Number of communities: 32
Elapsed time: 4 seconds
for (meta.col in colnames(seurat.object@meta.data)){
if(grepl(pattern = ("RNA_snn_res"), x = meta.col)==TRUE){
myplot <- DimPlot(seurat.object,
group.by = meta.col,
reduction = "umap",
cols = color.palette
) +
ggtitle(paste0(projectName, " dim", ndims, "res", gsub("RNA_snn_res", "", meta.col) ))
plot(myplot)
}
}
saveRDS(seurat.object, file = paste0(projectName, "_dim", ndims, ".RDS"))
clustree(seurat.object, prefix = "RNA_snn_res.", node_colour = "sc3_stability") +
scale_color_continuous(low = 'red3', high = 'white')
probe.list <- read.table()
xprsn.mtx <-
I think clusters from 80% variance at 0.5 and 1.0 resolution are most stable. We’ll do some statistics and DGE on that. Will also need to go back and play with SCTransform, since these are multiple cell types from multiple lanes. ## Load favourite dim reduction file
rds.file <- ""
seurat.object <- readRDS(rds.file)
ndims <- as.numeric(gsub("[^0-9]", "", stringr::str_split(rds.file, "_")[[1]][3]))
object.res <- ".0.5"
Idents(seurat.object) <- paste0("RNA_snn_res", object.res)
length(levels(seurat.object@active.ident))
# Number of filtered cells left in each pop
sapply(c("LSKm2", "CMPm2", "MEPm", "GMPm"), function(x) (c(nrow(seurat.object@meta.data[seurat.object@meta.data$orig.ident == x,]))))
par(mfrow = c(2, 2))
for (x in c("LSKm2", "CMPm2", "MEPm", "GMPm")){
h = hist(seurat.object@meta.data[seurat.object@meta.data$orig.ident == x, 'percent.mt'], breaks = 30, plot = FALSE)
h$density = h$counts/sum(h$counts)*100
plot(h,freq=FALSE, main = paste(x, "percent mitoC"), xlab = "percent mitoC", ylab = "Frequency")
}
par(mfrow = c(1,1))
VlnPlot(subset(seurat.object, subset = orig.ident == "MEPm"),
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 1, pt.size = 0, fill.by = 'ident', flip = TRUE)
Strt with comparing all clusters against all other clusters and write out cluster info calculate FindAllMarkers() for different idents and save to new file
# ident.list <- colnames(seurat.object@meta.data)[grepl("^RNA_snn", colnames(seurat.object@meta.data))]
ident.list <- c("RNA_snn_res.0.5", "RNA_snn_res.1")
for(tested.ident in ident.list){
Idents(seurat.object) <- tested.ident
all.markers <- FindAllMarkers(seurat.object)
xlsx::write.xlsx(x = all.markers[,c("avg_log2FC", "p_val_adj", "cluster", "gene")],
file = paste0(projectName, "_FindALLMarkers_dim",ndims, "_allres.xlsx"),
sheetName = tested.ident,
col.names = TRUE,
row.names = FALSE,
append = TRUE)
}
FindAllMarkers() lists for GSEAobject.res.allmarkers <- FindAllMarkers(seurat.object)
Mouse2HumanTable <- Mouse2Human(object.res.allmarkers$gene)
HGNC <- with(Mouse2HumanTable, Mouse2HumanTable$HGNC[match(object.res.allmarkers$gene, Mouse2HumanTable$MGI)])
head(object.res.allmarkers)
object.res.allmarkers$HGNC <- HGNC
tail(object.res.allmarkers)
sig.res <- object.res.allmarkers[object.res.allmarkers$p_val_adj <= 0.05, ]
sig.res <- sig.res[c("avg_log2FC", "HGNC", "cluster")]
sig.res <- sig.res[!(sig.res$HGNC == "" | is.na(sig.res$HGNC)),] # GSEA will fail if there are any blanks or NAs in the table
sig.res <- sig.res[]
for(cluster in unique(sig.res$cluster)){
print(paste("writing cluster", cluster))
new.table <- sig.res[sig.res$cluster == cluster, c("HGNC", "avg_log2FC")]
new.table <- new.table[order(-new.table$avg_log2FC), ]
dir.create(paste0("RankList_res", object.res, "_findAll_hgnc/"), showWarnings = FALSE)
write.table(new.table, file = paste0("RankList_res", object.res, "_findAll_hgnc/res", object.res, "cluster", cluster, ".rnk"), quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t", )
}
calculate FindMarkers() that distinguish each cluster (might overlab between clusters)
# ident.list <- colnames(seurat.object@meta.data)[grepl("^RNA_snn", colnames(seurat.object@meta.data))]
ident.list <- c("RNA_snn_res.0.5", "RNA_snn_res.1")
for(tested.ident in ident.list){
for (cluster in sort(as.numeric(levels(seurat.object@meta.data[[tested.ident]])))){
cluster.markers <- FindMarkers(seurat.object, ident.1 = cluster)
xlsx::write.xlsx(x = cluster.markers[,c("avg_log2FC", "p_val_adj")],
file = paste0(projectName, "_FindMarkers_dim", ndims, gsub("RNA_snn_", "", tested.ident), ".xlsx"),
sheetName = paste0("clst", cluster),
col.names = TRUE,
row.names = TRUE,
append = TRUE)
}
}
for (cluster in sort(as.numeric(levels(seurat.object@meta.data[paste0("RNA_snn_res", object.res)])))){
cluster.markers <- FindMarkers(seurat.object, ident.1 = cluster)
xlsx::write.xlsx(x = cluster.markers[,c("avg_log2FC", "p_val_adj")],
file = paste0(projectName, "_FindMarkers_dim", ndims, "res", object.res, ".xlsx"),
sheetName = paste0("clst", cluster),
col.names = TRUE,
row.names = TRUE,
append = TRUE)
}
Cluster stability could be influenced by: * cells in each population (cellranger v6 includes more cells than cellranger v1, especially in MEP) * dimensionality is incorrect * ScaleData didnt account for regression factors (e.g., “nCounts_RNA” or “nFeatures_RNA”) * Did not consider cell cycle * Incorrect normalization/scaling method * Clustering is too strict or not strict enough * neighborhood analysis used wrong parameters * Should include mitoC filter (there’s a chunk of MEP w/ mitoC @ ~40%) * SCTransform accounts better for sources of variability